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ABSTRACT Pearson’s chi-squared test can detect outliers in the data distribution of a given set of histograms. 
However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms 
of the histogram smoothness where techniques such as Whipple’s or Myers’ indices handle successfully 
only specific anomalies. This paper proposes smoothness outliers detection among histograms by using 
the relation between their discrete total variations (DTV) and their respective sample sizes. This relation 
is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The 
deviation of the histogram’s DTV from the value predicted by the model is used as the outlier score and the 
proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions 
about the histograms’ samples’ distribution, it has no hyperparameters that require tuning, it is not limited to 
only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary 
interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson’s chi-squared test. In 
case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds 
suspicious histograms. The source code is given at https://github.com/DiscreteTotal Variation/TVOR. 
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I. INTRODUCTION 
Outliers can be defined as data patterns that do not conform 
to an expected normal data behavior [1]. Since identifying 
outliers or anomalies can often be useful, performing outlier, 
i.e. anomaly, detection has an important role in many data 
related areas. For example, with the ever growing application 
of machine learning in various fields, having clean training 
sets, free of any unwanted outliers, can often significantly 
benefit the final production accuracy. On the other hand, 
in real-time applications such as network traffic or health 
monitoring, it is usually highly important to detect anomalies 
that could represent any form of unwanted behavior to prevent 
their potentially detrimental effects. Alternatively, it may be 
required to see which samples differ the most from the rest of 
the data and study them in more detail. 

Since there is a relatively high demand for anomaly and 
outlier detection methods in fields dealing with some form 
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of data, numerous methods have been proposed for various 
applications, as can be seen in several review papers [1]-[3]. 

A particular kind of data are histograms. First introduced 
by Pearson [4], histograms are by definition estimates of the 
probability distribution of a continuous variable. If there is 
a sample of real numbers drawn from the same distribution 
and all inside a given interval, then histograms can be used 
as their simple representation, and are also suitable for visual 
presentation. For histograms to be useful, the bin size should 
be adjusted accordingly to the data being described [5]-[8]. 
In certain cases for a group of such histograms it may be inter- 
esting to know whether some of them are outliers. This may 
include histograms describing samples drawn from another 
distribution different from the one of the majority of the sam- 
ples, but it may also include histograms just describing some 
less likely samples from the same distribution. To be clear, 
in such a case, histograms are not used as tools for outlier 
detection like in e.g. [9], but they are the data representations 
to be analyzed for the presence of outliers. 

In the simple case when only a single histogram is given, 
instead of multiple histograms, a straightforward approach 


This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 1807 


IEEE Access 


N. Banic¢, N. Elezovic: TVOR: Finding DTV Outliers Among Histograms 


to check whether it represents a sample that differs from a 
given distribution would be to use the Pearson’s chi-squared 
test [10]. It tests how likely it is that any observed differ- 
ence between the bins counts of the given histogram and the 
expected bin counts occurred by chance. However, for this to 
work, it is required to know the expected bin counts. 

On the other hand, if multiple histograms are given for 
samples that are assumed to have been drawn from the 
same distribution, then it is possible to find outliers among 
them by means of the Pearson’s chi-squared test even if the 
distribution is unknown. Namely, under Glivenko-Cantelli 
theorem [11] all the given histograms, except the currently 
tested one, can be used to get a reliable empirical distribution 
function, which in turn can be used to get the expected bin 
counts. Over time, numerous other techniques that can be 
applied in the described cases have been proposed [12]-[15]. 

While the problem of finding outliers in terms of distribu- 
tion is common, in some cases it is required to find histogram 
outliers in terms of some specific histogram property. For 
example, census data histograms are usually smooth, i.e. the 
difference between the counts of neighboring bins is rela- 
tively low, but in the presence of anomalies such as age heap- 
ing [16], this often stops being the case. One way to measure 
smoothness is to calculate total variation [17]. This means 
that by detecting deviations from the expected total variation 
it could be possible to detect smoothness outliers more easily 
than by means of some of the previously described tech- 
niques. Single-value properties similar to total variation in 
terms of simplicity, such as skewness, have already been used 
for outlier detection [18]. As a matter of fact, total variation 
has also found application in tasks such as classification [19] 
and outlier detection for graph signals [20]. 

Therefore, in this paper a new method for outlier detection 
in terms of discrete total variation (DTV) among histograms 
that describe samples drawn from the supposedly same, but 
unknown distribution is proposed. There are several contri- 
butions of this paper. First, it is mathematically proven that 
in terms of the underlying distribution there are only two 
possible cases of the relation between the sample size and 
the expected discrete total variation with the first case only 
being a special case of the second one. Second, a method 
is proposed that utilizes this relation to detect outliers that 
deviate from their expected discrete total variation. Third, it 
is shown that while the proposed method is not supposed to 
be used as a general outlier detector in terms of distribution, 
in some special cases it still performs better in this task than 
Pearson’s chi-squared test. Fourth, the proposed method is 
shown to be able to detect suspicious histograms on real-life 
census data. The practical applicability and usefulness of the 
proposed method are shown on synthetic data and real-life 
census data. The proposed method is simple to implement and 
it does not require prior knowledge of any distribution. 

The paper is structured as follows: in Section II the total 
variation is formally described, in Section III the theoretical 
derivation of the proposed method and its underlying model 
are given, in Section IV the experimental results obtained 


1808 


on synthetic data and historical real-life census data are pre- 
sented and discussed, and Section V concludes the paper. 


Il. THE TOTAL VARIATION 
Total variation of a differentiable function f is defined as [17] 
+00 


Ifllv = flat. (1) 


If f is non-differentiable, its total variation is given as [17] 


“im [7 O=LO= I 
Uflly = Jim fo RO ae 


If frli] = f * ®,(i/n) is a discrete signal obtained with an 
averaging filter ®,(t) = 1,g y-1,(t) and a uniform sampling 
at intervals n~!, then its discrete total variation (DTV) is 
calculated by approximating the signal derivative by a finite 
difference over the sampling distance h = n~! and replacing 
the integral in Eq. (2) by a Riemann, which then gives [17] 


ally = > Vali] — falé — U0 (3) 


(2) 


Despite being relatively simple to calculate, total variation 
is successfully used in areas such as denoising [21]-[24], 
image restoration [25]-[28], image super-resolution [29], 
[30], image enhancement [31], [32], compressive sensing 
applications [33], [34], computer graphics [35], and others. 


Ill. THE PROPOSED METHOD 

In this section, the proposed method for finding discrete 
total variation outliers among histograms and the method’s 
underlying model are described. In order to try to avoid 
any misunderstandings, the structure of this section has pur- 
posely been slightly extended. Section III-A gives the general 
idea of how to use the discrete total variation for outlier 
detection, Section II-B gives an initial statistical foundation, 
Sections I[J-C and III-D use this foundation to derive the 
relation between the sample size and its expected discrete 
total variation for two general cases, Section III-E uses this 
relation to propose the sample models based on the discrete 
total variation, Section IIJ-F describes the score calculation, 
Section II-G explains how to combine all these results into a 
single method, and, finally, Section III-H names this method. 


A. THE GENERAL IDEA 
Let there be a sample of N values, x, its histogram with n 
bins, and x; the number of values that fell in the i-th bin with 


yi =N. (4) 


Each of the bins has an arbitrary interval that can also be 
unbounded. The bins are not required to be of the same size. 
Let p; be the probability of a value falling in the i-th bin and 


yee (5) 
i=1 
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Due to randomness the discrete total variation of X,, i.e. 
|Xn|ly can differ for each sampling, but it should mostly not 
differ significantly from its expected value D[[IXally | for a 
given N and probabilities p;. For a given x, the difference 
between its ||x;||y and E [Xn | v| can serve as a score of how 
much the sample differs from the expected behavior. Such a 
score has several drawbacks as well as advantages. 

The main disadvantage is that it is required to know 
D[IXnllv | for any given N or at least to know the relation 
between these two values for proper scaling and comparison. 

The main advantage of such a scoring is the simplicity of 
its calculations due to the very definition of the discrete total 
variation. Further, because of that it is not necessary to know 
the desired sample distribution, which significantly widens 
the application possibilities. Finally, it is not very likely that 
two samples of the same size have histograms of the same 
or similar smoothness, i.e. discrete total variation and that 
their scores differ significantly. That means that if this score 
is calculated for every sample in a group of samples that are 
expected to have similar smoothness, then the ones with the 
highest scores can be considered as outlier candidates. 

However, in order for this to be practically usable, first an 
analytical relation between N and E [Xn | vl has to be found. 


B. THE STATISTICAL BACKGROUND 
The first step in finding a relation between N and E [xn | v] 


; , , 2]. : : 
is to examine 3] (x: — xj) | in more detail by using the 


variances of x; and x;, i.e. Var [x;] and Var EAR respectively: 


(x — x)"| 
=E[ (x -Elil-4+E 
Var [x;] — 2E [i — Ely) (xj —E 


[3 ])"] 


. . 2 
+Var [x;] + ( f [xi] — E [x;]) : (6) 
The value of x; for a given i has binomial distribution so that 
{ [xi] = Npi, (7) 
Var [xj] = Npi (1 — pi) - (8) 


For the second term of the last form of Eq. (6) it holds that 
2 [@i— Ela) (yj -E[y))] =El@-Ebx]. © 
The result of Eq. (9) can now be further developed as follows: 


5 [¢; Ebi) »] = E[E[@— Eb) y] by] 
= E[(@j- Ela) E [xls] 


= [os 2 Lil) 5 5 (N %0| 
=-7 FB e(x; — E[xi) xi] 
— Pi 
_ __ Pj >{ 2] _ wry.72 
= a [37 Lai?) 
— __?i Var [xi] = —Npip;- (10) 
1 — pi , 
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Combining Eq. (7), Eq. (8), and Eq. (10) develops Eq. (6) to 
a | (xi — x)" | 
= Np; (1 — pi) + 2Npipj + Np; (1 — pj) + N? (pi — pi) 
=r (pi — i) +N (vi +P) - (pi -2))°). (11) 


Based on the values of p; there are two cases of further actions 
for establishing a relation between N and E [xn | v]- These 
two cases are covered in the following subsections. 


C. UNIFORM DISTRIBUTION 

1) UPPER BOUND 

The first case is when the distribution of the sample and 
consequently the distribution of the histogram are uniform so 
that the probability of a value falling in the i-th bin is then 


1 
ed ae a ae (12) 


When this is applied to Eq. (11), it eliminates its first term 
and it simplifies its second term, which then gives the form 


s[(—x)"] ==. (13) 


n 


Taking into account that the square root is a concave function 
and applying the Jensen’s inequality [36] to Eq. (13) gives 


| Vis] =8lls—alls fB[(—x)"} a9 


This inequality can than be applied to all neighboring bins: 


n—1 n—1 


i [lati — xii] < 2 


i=1 i=l 


D[(xi41 —xi)"]. 5) 


Due to the basic properties of the expectation, it holds that 


n—1 n—-1 
E [|xi+1 — xil] = ap poe. (16) 
j i=1 


Applying Eq. (3), Eq. (13), and Eq. (16) to Eq. (15) gives 


2N 
[IIxally] < (a — 1) yj —. (17) 


This gives the upper bound for the expected value of the dis- 
crete total variation and thus the first relation between N and 
Y [Xn lv] if the sample numbers are uniformly distributed. 


Il 
_ 


2) EXACT VALUES 
Let F(n, N) denote the expected value of the discrete total 
variation as a function of two key parameters n and N: 


FN) :=E[Ixallv] (18) 


Theorem 1: The exact value of F(2, N) in closed form is 


F(2,N)=2-* w+ 1/2] ( (19) 


a 
[N/2\)° 


The proof of Theorem | is given later in Appendix. 
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It is relatively easy to show that for each r it holds that 
F(2, 2r) = F(2, 2r — 1) (20) 


and this leads to some unwanted consequences later on in the 
paper, but there they are mentioned and handled properly. 
The case of uniform distribution means that a histogram 
is a realization of the multinomial distribution and its bins 
X1,%X2,...,X, are random variables. The distribution of each 
xi is BIN, ‘), i.e. it is binomially distributed with parame- 
ters N and i Variables x; are not independent, since their 
sum equals NV. However, because of the symmetry, variables 
x2 —X1,..-,Xn —Xn—1 have the same distribution, which gives 


F(n, N) = E[|x2 — x1] + +++ + lin — Xn-11] 
= (n— DE[|x2 — |]. (21) 


Before continuing, for the sake of convenience, first the 
notation for the multinomial coefficient has to be given as 


N N! 
et ee, (22) 
ketsiciska}  Ral-~<Bl 


Theorem 2: The expected value of the total variation of a 
histogram of uniformly distributed values is calculated as 


(“) 
F(n, N) = 2(n— 1) ( —— > 
n 


ki +ko<N 
ky <ko 


N — 9)~ ki tke, 

(oN hy mln? (ka — ki). 23) 

The proof of Theorem 2 is given later in Appendix. By 
using Eq. (23) it is possible to calculate the expected total 
variation for all reasonable values of n and N with some 
examples being shown in Table 1. However, if using Eq. (23) 
turns out to be computationally too demanding, the solution 
is to develop and use some appropriate asymptotic forms. 


3) ASYMPTOTICS 
By taking into account the well-known asymptotic form of 
the central binomial coefficients that is commonly given as 


ar\  # 
r /mr 
it follows that the asymptotic form of F(2, N) is given as 


F(2,N) = ce) ~ (an. (25) 


The experimental calculations suggest that the following 
hypothesis can be stipulated for the uniform distribution: 
Hypothesis 1: For N sufficiently large, we have 


asr > co, (24) 


F(n, N) © (n— 1)F (2 *) . (26) 


The right side of this equation represents the sum of the 
discrete total variations of two-binned histograms of the uni- 
form distribution with sample size being equal to the expected 
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TABLE 1. The comparison of the exact values of F(n, N) with the values 
obtained by Eq. (27) for some n and N. 


n N = 100 N = 1000 
Eq. (27) | exact value | Eq. (27) | exact value 
2 | 7.97885 | 7.95892 25.2313 | 25.2250 
3 13.0294 | 13.0213 41.2026 | 41.2000 
4 16.9257 | 16.9045 53.5237 | 53.5170 
5 | 20.1851 | 20.1472 63.8308 | 63.8188 
6 | 23.0329 | 22.9752 72.8366 | 72.8183 
7 | 25.5892 | 25.5090 80.9203 | 80.8950 
8 | 27.9260 | 27.8207 88.3096 | 88.2765 
9 | 30.0901 | 29.9577 95.1533 | 95.1116 
10 | 32.1142 | 31.9525 101.554 | 101.503 
20 | 47.9395 | 47.3907 151.598 | 151.427 
30 | 59.7437 | 58.6681 188.926 | 188.595 
40 | 69.5808 | 67.8604 220.034 | 219.509 
50 | 78.1927 | 75.7182 247.267 | 246.522 
15 a an" 


20 40 60 80 100 


FIGURE 1. The values of F(4, N) for 1 < N < 100. 


number of values. If this hypothesis is accepted, then the 
following asymptotic is true for the uniform distribution: 


F(n, N) © ee N. (27) 


In Table | the values obtained by Eq. (27) are compared to 
the exact values of F'(n, N) for some chosen n and N. 
Hypothesis | and the results of the numerical calculation 
furthermore suggest that the following hypothesis is true: 
Hypothesis 2: For eachn > 3, the function N r F(n, N) 
is increasing and strictly concave, hence, foreachO <k <N 


F(n,k)+ F(n, N —k) < 2F(n, >) (28) 


Function N + F(2,N) is nondecreasing, but it is not 
strictly concave, because as demonstrated by Eq. (20) its 
neighboring values can be equal. The proof of these two 
hypotheses may be very difficult, but they are not essential for 
the conclusions that are drawn later in the paper. The diagram 
in Fig. | shows the situation for nm = 4 and 1 < N < 100. 

Let F,(n, N) denote the expected value of the the circular 
variation, which unlike the usual variation has an additional 


VOLUME 9, 2021 


N. Banic¢, N. Elezovic: TVOR: Finding DTV Outliers Among Histograms 


IEEE Access 


term |x, — x,| for the absolute value of the difference between 
the first and the last bin. F,(n, N) is then defined as 


F(n, N)=E [|x2—x1|+...+]%n—An-1|+]21-anl]. (29) 
By taking into account Eq. (21), it follows from Eq. (29) that 


F.(n,N) = —— Fin, N). (30) 
Applying Eq. (21) and adjusting the result for later use gives 
n 
Ea Ny DET all 


= nE[|x2 — x1|] 


 [|x2 — x1|] + Ellen — 11D. G1) 


All possible histograms x, can be split into disjoint groups, 
according to the number of realizations which fall into the 
first n/2 bins. Let gx be the probability that these bins contain 
exactly k realizations. Because of the symmetry, gx = gn—k 
for each k. Since other n/2 bins contain exactly N — k 
realizations, the conditional distribution of the realizations in 
the first n/2 bins is again uniform. Having all this in mind 
and applying the partition theorem to Eq. (31) gives 


F.(n, N) 


syne 
Sale G. k) + Fe (5.N—-k)]. (32) 


Applying Eq. (28) and the equality (es 4k) = | leads to 
the following inequality that holds for each even n > 4: 


[lx2 — x1] | k] + E[lxn — xn—-1] | N — &)) 


F.(n,N) < 2F, . oF (33) 


Here n has to be greater than 4 because having n = 4 effec- 
tively leads to use of the function F(2, N) on the right side of 
the inequality, and as explained earlier, this is inappropriate 
for Eq. (28). If n = k2" where k > 3 andr > 0 are integers, 
then taking the inequality above recursively leads further to 


: N 
FA(k2',N) < 2'F, (« =) (34) 
wherefrom for all suitable N and n it then follows that 


n kN 
F.(n, N) < — Fe («. -) : (35) 
k n 


If k = 2 is taken, then the inequality is no longer necessarily 
valid because of the involvement of F(2, NV). However, the 
obtained form yields a better approximation of F.(n, N) as 


2N 
F(z.) 
n 


wherefrom after applying Eq. (30) it then further follows that 


F(n, N) © 


2N 
F(n,N) ¥ (n— 1) F | 2, — ], 
n 
which in turn is an approximation stipulated in Hypothesis 1. 
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4) APPROXIMATION ERROR 

Fig. 2 shows the difference between the results of Eq. (23) 
and Eq. (27), which represent the exact and approximated 
values of F(n, N), respectively. It can be seen that in cases 
where JN is several times greater than n, the approximation 
error becomes relatively insignificant for practical purposes. 
The error only becomes significant when the value of N is 
relatively close to the value of n or below it, but it must be 
additionally stressed that this rarely occurs in practice since 
having such values of n and WN is not too useful. The plots 
in Fig. 3 further suggests that if required, the approximation 
error could be modelled accurately. However, for the later use 
here it is enough to conclude that having a sufficiently large 
value of N renders the approximation error insignificant. 


D. NON-UNIFORM DISTRIBUTION 

The second case is when the distribution of the sample and 
consequently the distribution of the histogram are not uni- 
form. In other words this is the case where Eq. (12) does not 
hold, i.e. when p; 4 p; for at least one pair of i andj. Applying 
to Eq. (11) all steps that have led to Eq. (17) gives 

n=1 


YZE[ | - x] 


i=1 


n-1 

=e (Nn? (pis Pi)” +N (pit +pi- (Pi+1-Pi)”) 
i=1 
n-1 

SS (v N? Witt — pi) 


i=1 


IA 


+N (vit + Pi — (itt =p?) 
n-1 


=)) (w Ipit1—pil + (Nn (vit +Pi- (ru11-P0?)) 


i=1 


n=l n=1 
=N)0 ipiti-pilt+VN >> V (pie tpi- (pit1—pi)”). 
i=1 i=1 
(36) 
If D is the sample’s theoretical distribution, then the first term 
of Eq. (36) is the discrete total variation of D that is given as 
n—-1 
IDlly = >> Wisi — pil. (37) 
i=1 
The second term is a bound for expectation of the deviation 
of this given sample from its theoretical distribution. A rough 
estimate for this second term is the value 2,/n — 1/N. It is 
obtained by first removing the subtracting part and applying 
the inequality /u+v < /u-+ /V for u, v > 0, which gives 
n—-1 
So (ist + pi — (pi+1 — Pid’) 


i=1 


< + Vee +Pi) < y+ 3 vr (38) 


i=1 
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a 
75 100 125 150 175 200 225 7 275 300 325 350 375 400 425 450 475 500 


er 


50} 


40; 


20) 


1 50 100 150 200 250 300 350 400 450 500 


FIGURE 2. The difference between the results of Eq. (23) and Eq. (27), which represent the exact and approximated values of F(n, N), respectively: 
a) the absolute error, b) the relative error, and c) the dependance of certain relative errors on n and N. 


Since ,/p; and ,/p;+1 are non-negative, the sums in Eq. (38) is 


can effectively be seen as Lj-norms of (n — 1)-dimensional 4n—8 4 
vectors. Applying the inequality ||v||; < Vd ||v||2 where d is IT lv = nm ss n+2 (40) 
the di i f thi t 37] to th i 
EU CEO Nee tt | Cen eue ees for an even n, while in the case of an odd n this is given as 
= = 4n-6_ 4 
>- pit >> pint ya ee (41) 
i=1 i=1 
=I 1/2 | 1/2 The square distribution Q for which pj = C i2 with n bins has 
< J/n—1 (x | a (x Di «) a discrete total variation that can be approximated as 
i=1 i=i 


3 
C3 42 
< V¥n—-1(14+1) =2Vn-1. (39) Illy ~~ (42) 
Next, in the case of the square root distribution S for which 


It i ful to know the di tal variati f : ; : ll Sage. ath 
t is useful to know the discrete total variation of some ies C./ata atthe bine he approxtrantionds sivenas 


important distributions. Examples of their histograms are 
shown in Fig. 4. The uniform distribution has a zero total ISlly 3 


xe, 43 
variation. For the triangular distribution 7 with n bins this 2n G3) 
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FIGURE 3. The relation between the error when using Eq. (27) and the values of: a) sample size N and b) number of bins n. 


For the geometric distribution G with parameter p this is 


IGlly =p, (44) 
for the Poisson distribution P with parameter A > | it is 
2a eA 
Ply * TAIL (45) 


The discrete total variation for a unimodal discrete distribu- 
tion with mode M is bounded by 2M. The mode for symmet- 
ric binomial distribution B(n, 5) is yr(j,/2)) and 


8 
Bly + /—. 
Mn 


The normal distribution \/(0, oa”) is a continuous one with 
unbounded support and its theoretical DTV depends on ras- 
terization. The total variation of the probability density func- 


(46) 


tion is 


. If [—c, c] is essentially the support of the 
oV20 
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2c 
distribution and if n > —, then ||\/||y can be approximated: 
oO 


2c /2 
IMlv ¥ —/—-. 
no V 1 


Let D be any distribution and x, the histogram with n 
bins of a corresponding sample of N values drawn from the 
distribution D. Then similarly to Eq. (36) it can be written 


ml [IRiv]VN (48) 


where F is a deviation from the theoretical distribution. If 
there was no randomness and all values were distributed 
exactly as predicted by the probabilities, then : [[xnlly | 
would be ||D||y - N. Therefore, the second term is due to 
the randomness. A further thing to notice here is that as N 
grows, randomness plays an ever smaller role in Eq. (48) and 
as N limits at infinity, the term C,N gets to fully dominate 
in Eq. (48), which is also expected in accordance with the 


(47) 


IXully] < IDily -N+ 
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FIGURE 4. Histograms for the a) triangular, b) quadratic, c) square root, d) geometric, e) Poisson, and f) binomial distribution. The shown histograms 
are merely for the sake of illustration and the x-axes do not strictly follow the equations in Section III-D. 
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FIGURE 5. From a) the theoretical distribution by adding b) the deviation due to randomness to c) the final sample distribution. 


Glivenko-Cantelli theorem. In Fig. 5 the total variation of 
the theoretical distribution and the total variation of a sample 
are equal. This will be the case for all samples which do not 
alter order between adjacent bins. Therefore, the alteration 
from the theoretical distribution means that the corresponding 
sample is essentially different from theoretical one. Devia- 
tion from the theoretical distribution can be approximated 
as total variation of a sample from uniform distribution and 
therefore the bounds written before can be applied to any 
distribution. 

With regard to the distribution, the use of the discrete total 
variation that is somewhat similar to the Lj-norm may be 
reminiscent of the assumption of the Laplace distribution. 
However, no minimization, regularization, or any similar 
process that requires such an assumption is being performed 
here. Therefore, it should be stressed again that the relations 
obtained here can be applied to samples of any distribution. 


E. THE PROPOSED MODEL 
After taking into account the previous subsections’ results, it 
is reasonable to consider the model for E [ [X,|| v| to be 


m= aN + bVN. (49) 
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This model can be fitted directly to the sizes and discrete 
total variations obtained on the given histograms that are to be 
checked for outliers. If there is not enough given histograms 
to cover the desired value ranges of N, then additional ones 
can be created by randomly subsampling the given ones. In 
the case where a larger amount of histogram outliers is sus- 
pected, then their detrimental influence on fitting of Eq. (49) 
can be reduced by applying methods such as RANSAC [38]. 

Alternatively, if the distribution, i.e. the values of p; for the 
histograms’ bins are known, then a and b can be obtained 
through Monte Carlo simulation by randomly creating arbi- 
trarily many histograms of various sizes N and then fitting 
the model Eq. (49) to their sizes and discrete total variations. 


F. SCORE CALCULATION 

Once the model described by Eq. (49) has been fitted to data, 
the next step is to assign an outlier score to each of the given 
histograms. The first step is to calculate a histogram’s discrete 
total variation. Next, the discrete total variation expected for 
the histograms’s size is obtained by using Eq. (49). Finally, 
the absolute difference between these two values is 


d = ||lxally —m|. (50) 
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However, d cannot yet be used as the score because the stan- 
dard deviation of the discrete total variation for histograms 
of random samples varies depending on the samples size 
N, which means that the significance of d depends on N. 
This means that first the influence of the sample size on 
the standard deviation has to be removed. Additionally, the 
discrete total variation is already a statistic of the sample, 
which means that its standard deviation is actually the stan- 
dard error [39]. Many standard errors that do not include 
division by WN are proportional or close to being proportional 
to /N. , at least in limit, and in practice this is also the case 
with the discrete total variation. This can intuitively be seen 
in the form of the second term of Eq. (48) as discussed earlier. 
Therefore, for practical purposes the influence of N ond can 
be approximately removed by calculating the distance d’ as 


dd |Ibxully—m| _ |II&ully-aN +N 
VN VN IN 


The value of d’ can now be used instead of the value d as 
the outlier score for the histogram that it was calculated for 
because it is normalized with respect to the standard error. 

It must be mentioned that strictly speaking Eq. (51) is 
theoretically not correct because the expected value of the 
discrete total variation is not always proportional to NV. How- 
ever, during the research conducted for this paper it has been 
empirically shown that for all tested distributions the standard 
error was proportional to /N and that using Eq. (51) is a 
good practice, even though it may introduce inaccuracies. 
Since Eq. (51) was specifically designed to comply with the 
statistical properties related to the discrete total variation as 
discussed here, using some other score calculation would 
potentially require a major overhaul of the whole framework. 

An alternative to using Eq. (51) that unlike Eq. (51) does 
not include a explicitly derived formula is to take all data 
from the given histograms, use it in Monte Carlo simulations 
to create samples of various desired sized, for each of these 
sizes calculate the discrete total variations and their standard 
deviation, and fit a model to these sizes and their respective 
standard deviations. If enough data is available, this should 
result in a relation that is very similar to the one in Eq. (51). 

Since d’ is the normalized distance from the expected 
discrete total variation and since it resembles the t-statistic, 
it could be further used to also provide a probabilistic inter- 
pretation for a given histogram. However, the goal of this 
paper is not to propose a new statistical test that can be used 
in hypothesis testing with predetermined significance levels. 
The main goal of this paper is just to find the most likely 
outlier candidates based on the discrete total variation and the 
distance d’ also suffices for such ranking. Therefore, proba- 
bilistic interpretation calculation is omitted in this paper. 


d' 


(51) 


G. APPLICATION 

With all the required background given in the previous sub- 
sections, it is possible to propose a new method for detecting 
histogram outliers in terms of the discrete total variation. 
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Algorithm 1 The Proposed Method TVOR 


Input: M input histograms x) x?) bars x") 
Output: scores for input histogram d}, d5,..., diy 


1: fori e {1,2,...,M}do 
2: Si= ini a > Calculate sample size 


> Calculate discrete total variation 


4: 
5: a, b = FitModel (Ui in vi) > Fit Eq. (49) to data 
6: fori € lace waa 

__ |virasit+b./si 
a: ar ae 
8 


> Calculate the score 
: end for 


First, multiple histograms for the samples of various sizes 
are given as input. The histograms are supposed to have 
the same bins where each of the bins can have an arbitrary 
interval. It is also supposed that all these samples are drawn 
from the same distribution and the goal is to check which of 
them are most likely to be outliers in terms of the discrete 
total variation. Next, the discrete total variation is calculated 
for each of these histograms. Then, model Eq. (49) is fitted to 
histogram sizes and discrete total variations. Finally, each of 
the histograms is scored by applying Eq. (51). The histograms 
for which the highest score values were obtained are the 
most likely outlier candidates in terms of their discrete total 
variation. All these steps are summarized in Algorithm 1. 

Here it should be additionally stressed that the proposed 
method has no hyperparameters whatsoever that would have 
to be tuned or that would otherwise influence the result. It 
may seem that the number of histogram bins n is a tun- 
able hyperparameter, but the proposed method is agnostic 
of the underlying histogram samples - it merely receives 
already existing histograms as inputs. The histograms are 
only assumed to have the same bins. It is not even important 
what the range of the bins is nor is it important whether they 
are bounded. 


H. THE PROPOSED METHOD’s NAME 

Due to the proposed method’s model’s reliance on the discrete 
total variation, it was named Total Variation Outlier Recog- 
nizer (TVOR) or for the sake of simplicity just Tvor, which 
is pronounced /tu6:1/ and it means skunk in Croatian. 


IV. EXPERIMENTAL RESULTS 

In order to validate the proposed method, several experi- 
ments have been conducted on both synthetic and real-life 
data. Additionally, it is shown why the proposed method is 
more appropriate than some other similar methods. To give a 
clear and descriptive overview of the method’s properties, the 
structure of this section is purposely slightly more extended. 
First, Section [V-A describes a baseline method for histogram 
outlier detection based on the Pearson’s chi-squared test [10] 
to compare its results to the ones of the proposed method. 
In Section IV-B the behavior of the proposed method in 
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several scenarios of changing conditions is demonstrated and 
additionally explained by several experiments for distribution 
outlier detection among histograms of random samples of 
different sizes drawn from the normal distribution and the 
beta distribution with various parameter values. Similar to 
that, Section IV-C contains experiments for discrete total vari- 
ation outlier detection among histograms of random samples 
of various sizes drawn from the beta distribution. The real- 
life practical use of the proposed method is demonstrated 
in Section IV-D on the histograms of the birth years taken 
from census data of several populations from the same time 
frame. Section IV-E shows the advantage of the proposed 
method over some other methods that can be used for similar 
purposes. The obtained results are discussed in Section IV-F. 
The online repository with the source code and the data 
required to recreate the results is described in Section IV-G. 


A. THE BASELINE METHOD 
The proposed method’s goal is to detect outliers speficically 
in terms of the expected discrete total variation, which can 
differ significantly from detecting distribution outliers in 
general. Therefore, the goal of this section is to show the 
difference in the performance of the proposed method and 
the Pearson’s chi-squared test [10]. This test can be used 
to check whether a histogram is an outlier by comparing 
the values of the histogram’s bins, which serve here as the 
categorical variables, to the values that are expected under 
a supposed distribution. However, since in the problem that 
is being analyzed in this paper the supposed distribution is 
unknown, the expected bin values first have to be estimated. 
The first step in calculating the i-th expected bin value is to 
sum the values of the i-th bin in all given histograms except 
the tested one. When this is done for all m bins, all of the 
obtained bin sums are divided by the sum of values of all 
bins in all histograms except the tested one. These normalized 
sums now represent the estimations of the probabilities that 
a value will fall in each of the histogram bins. The more 
histogram are given, the better these estimations are under the 
Glivenko-Cantelli theorem. Next, all these estimated prob- 
abilities are then multiplied by the sum of all bin values 
in the tested histogram. In that way the sum of the bins in 
the tested histogram and the sum of the estimated expected 
bin values are the same. Then, a small positive number is 
added to all scaled bin values in order to avoid division 
by zero during the calculation of the Pearson’s chi-squared 
test statistics. Finally, the obtained Pearson’s chi-squared test 
statistic is used as the outlier score for the tested histogram. 
The described procedure is summarized in Algorithm 2. 


B. SYNTHETIC DATA FOR DISTRIBUTION OUTLIERS 

1) THE GOAL 

Since there is much freedom in the overall data generation 
procedure when using synthetic data and less or no limitations 
when compared to using real-life data, the goal of this subsec- 
tion is to demonstrate and explain in more detail the behavior 
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Algorithm 2 The Baseline Method 


Input: M input histograms x) x?) shard x) 
2 


Output: scores for input histogram ree G 0s Xi 
: fori e {1,2,...,M}do 
; (i) 


> Calculate sample size 


4.$= aia 115i > Calculate the sum of all bins 
5: fori ¢ {1,2,...,n}do 

6: bi = ae ou > Calculate individual bin sum 
7: end for 

8. «= 10° 

9: forie {1,2,...,M}do 

10: Be a a 


> A small positive number 


11: of = “i > The observed bin value 
(i) bj-x,” : 
12: E — = Sj +E > The expected bin value 
13: end for 
00g 
14: = pe ( d a > Calculate the score 
i 
15: end for 


of the proposed method depending on gradual changes of 
various conditions. The performance is here first measured 
in terms of distribution outlier detection, even though the 
proposed method was not designed specifically for that task, 
while the performance in terms of DTV outlier detection is 
described in the following subsection. The experiments were 
performed for cases when the inlier and outlier samples for 
histograms were from the same distribution with changed 
parameter values and from different distributions. 


2) EXPERIMENTAL SETUP 

The experiments for distribution outlier detection on syn- 
thetic data, i.e. histograms of random samples, were con- 
ducted by repeatedly first simulating the mixtures of inlier 
and outlier samples, then trying to recognize the outlier 
samples by means of applying the baseline method and 
the proposed method, and finally examining the results of 
these simulations. The experiments were conducted for two 
general cases of inlier and outlier random sample distribu- 
tions by mixing them in 10* simulations. In the first case 
both the inlier and outlier samples were from the normal 
distribution. 

In each simulation of this first case, the inlier data was 
prepared by generating 100 random inlier samples drawn 
from the normal distribution with mean 0 and variance 1, i.e. 
N (0, 1). The size of each individual sample was randomly 
chosen to be between 500 and 1000. The histogram bins were 
set to be evenly spaced on the interval [—c, c] where c is an 
arbitrarily chosen value used to check the behavior of various 
bin arrangements. Each sample value falling outside of the 
interval [—c, c] was replaced with the closer one of c and —c. 
Several values of c, as well as several values of number of 
bins c, were used to check the effect of changing conditions. 
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FIGURE 6. The probability density functions of the beta distribution and 
triangular distribution used in the experiments. 


Furthermore, in each simulation, the outlier data was gen- 
erated by drawing a certain number of random samples from 
N (0, o°) for various o # 1. The sample size was randomly 
determined in the same way as for the inlier samples. For both 
the inlier and outlier data the values of c and n were set to 
the same values to assure having histograms with the same 
bins. Next, the baseline method and the proposed method 
were applied to the combined inlier and outlier data to score 
individual histograms. Finally, the mean value of the rank of 
all outlier examples obtained by each method was calculated 
as the performance score of each method. A lower mean rank 
here means a better performance in terms of outlier detection. 
For the sake of simplicity, zero-based numbering was used for 
ranks. This means that in the case of a single added outlier 
sample, the optimal mean rank of a tested method is 0, while 
in the case of e.g. 10 added outlier samples, the optimal mean 
rank is 4.5 since this is the average value of the first 10 zero- 
based ranks, which should all be assigned to outlier samples’ 
histograms in the case of a method that performs ideally. 

In short, every instance of the simulation setup is uniquely 
determined by the number of histogram bins n, the number 
of added outlier samples, the value c used to determine the 
interval of the binned values, and the value of o for outlier 
distribution. Simulations for each instance were repeated 
104 times to check the performance of the baseline method 
and the proposed method in various sampling conditions. 

n the second general case, the inlier samples were drawn 
from the beta distribution with parameter values a = 7 
and 6 = 1, while the outlier samples were drawn from the 
triangular distribution with parameter values a = 0, b = 1, 
and c = 0.5. The probability density functions of these 
distributions are shown in Fig. 6. Similarly to the previous 
case, several combinations of the number of bins n and the 
number of outlier sample histograms added to the 100 inlier 
sample were checked. For each combination, the results of 
methods’ performance were averaged over 10* simulations. 


3) RESULTS 
After examining the results of performing simulations for a 
large number of setups when both the inliers and the outliers 


VOLUME 9, 2021 


are from the normal distribution, due to the similarity of many 
of the results, it was decided to show only those that can 
be used to summarize them all. These results are shown in 
Fig. 7. The first thing to observe is that in the majority of 
the cases the baseline method based on the Pearson’s chi- 
squared test performs better in terms of outlier ranking. This 
is mainly because the proposed method was not designed to 
find outliers in general, but to find outliers in terms of the 
discrete total variation. Interestingly, however, the exception 
to this are the cases when there is a relatively small number 
of bins, which can be seen in Figs. 7a and 7f, and cases with a 
high amount of added outlier sample histograms, which can 
be seen in Fig. 7e where the proposed methods outperforms 
the baseline method for all given numbers of bins. This means 
that even if the proposed method was not designed for the 
same task as the baseline method, in some cases it is still 
able to outperform it, which may be useful should such cases 
emerge. A more detailed analysis of the performance results 
shown in Fig. 7 is given in Appendix, which also explains the 
sudden drops in the performance such as the one in Fig. 7d. 

In short, the proposed method generally performs worse 
than the baseline method. However, in the cases of smaller 
values of n, i.e. in the cases of a smaller number of bins, 
as well as in the cases with a high amount of outliers, it 
may perform better. Similar results can be obtained with 
some other distributions as well and therefore they have been 
omitted here. If required, any other experiments with a similar 
setup can be conducted by using the source code publicly 
available in the repository that is described later. 

Next, Fig. 8 shows the results of the experiment where the 
inlier and the outlier samples were drawn from the beta and 
the triangular distribution, respectively. As can be expected 
by viewing Fig. 6, the baseline method outperforms the pro- 
posed method in most cases since the difference between the 
used distributions is significant. Nevertheless, Fig. 8b again 
shows that the proposed method may be able to outperform 
the baseline method in the case of a high amount of outliers. 

The performance drop of the proposed method for several 
values of n shown in Fig. 8a deserves some additional com- 
ments. As shown in Fig. 9a, the theoretical DTVs of both 
distributions are clearly separated for all shown values of 
n. This means that if the random samples were sufficiently 
big, then the performance should significantly improve in 
accordance with Eq. (48). Namely, in that case the influence 
of the sample size significantly overpowers the influence of 
the randomness. As a matter of fact, if the whole experiment is 
repeated with random samples having their sizes increased by 
several orders of magnitude, then both the proposed method 
and the baseline method have the same ideal performance. 
However, as mentioned earlier, the size of each sample used 
in the experiment whose results are shown in Fig. 8a was 
randomly chosen to be between 500 and 1000. For such 
sizes, the randomness still has a substantial influence on the 
histograms’ DTVs. This is illustrated in Fig. 9b, which shows 
the mean DTV calculated for 10° random samples of size 
1000 for various values of n created for both the beta and the 
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FIGURE 7. Comparing the performance of the proposed and baseline methods. First row: performance with 1 added outlier and c = 5 for 

a) o = 0.9 and b) o = 1.5. Second row: Performance with 1 added outlier and o = 0.5 for c) c = 5 and d) c = 10. Third row: Performance with 
90 added outliers and c = 5 for e) o = 0.9 and f) o = 1.5. The results for TVOR + RANSAC was added only in the third row because for the 
results in the first and the second row the difference was not that significant. 


triangular distribution. It can be clearly seen how this differs 
from the case of the theoretical DTVs and this can be used 
to explains the particularly low performance of the proposed 
method when n is 30 and 35 shown in Fig. 8a. Namely, for 
these values of n, the mean values of DTVs become so close 
that, with the influence of randomness included, it becomes 
difficult to successfully distinguish between the inlier and the 
outlier histograms based only on their DT Vs. The dependence 
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of the proposed method’s performance on the size of the 
samples is further analyzed in more detail in Appendix. Based 
on all the results shown here and in Appendix, it can be 
concluded that the proposed method’s performance improves 
as the size of the samples increases. 

Overall, in terms of distribution outlier detection, the per- 
formance of the proposed method is only indirectly depen- 
dent on the inlier and the outlier distributions. As shown, it is 
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FIGURE 8. Comparing the proposed and the baseline method in terms of distribution outlier detection performance where 100 inlier random 
samples are drawn from the beta distribution with « = 7 and 8 = 1, while the triangular distribution with a = 0, b = 1, and c = 0.5 is used to 
draw the added a) 1 outlier sample histogram and b) 90 added outlier histograms. 
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FIGURE 9. The comparison of the used beta and triangular distributions in terms of a) the theoretical discrete total variation ||D||, described 
in Eq. (37) and b) the mean DTV calculated for 106 random samples of size 1000 for various values of n. 
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FIGURE 10. The histogram of a random sample drawn from the beta distribution with « = 2 and £ = 3 in the case of a) no heaping and 
b) heaping by moving 10% of randomly chosen items to bins with ordinal numbers divisible by 5 closest to them. 


directly dependent on the difference between the theoretical 
DTVs of these distributions, which is in turn dependent on 


the chosen histogram bins. This means that, depending on 
the histogram bins, the proposed method may perform well 
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FIGURE 11. Comparing the performance of the proposed and baseline methods averaged over 104 random trials in cases where the number of 
outlier random samples bin values added to the original 100 inlier random samples was a) 1, b) 10, c) 30, d) 50, e) 70, and f) 90. The inlier and 
outlier random samples were drawn from the beta distribution with « = 2 and 8 = 3, but the outlier samples were additionally changed in 
order to make their histograms have a prespecified amount of heaped bin values. 


even when the inlier and the outlier distribution are same, 
but with slightly different parameters. On the other hand, for 
significantly different inlier and outlier distributions that have 
similar theoretical DTVs for the chosen bins, the proposed 
method may perform poorly. The opposite cases are also 
possible. Nevertheless, this is not too problematic because 
the proposed method was not designed for distribution 
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outlier detection, but specifically for the DTV outlier 
detection. 


C. SYNTHETIC DATA FOR TOTAL VARIATION OUTLIERS 

1) THE GOAL 

The goal of this subsection is to demonstrate the behavior 
of the proposed method for the case that it was originally 
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FIGURE 12. The experiments on the German census of 1939: a) histogram of birth years of the German census of 1939 based on the data from [40], 
[41], starting from year 1850, b) fitting the proposed method's model in Eq. (49) to data for subsamples of the German census of 1939, c) the 
relation between the sample size and the standard deviation of the discrete total variation obtained through Monte Carlo simulations for the 
subsamples of the German census of 1939 and a fitted function y = a/N, and d) the distribution of discrete total variations obtained for 100k 


subsamples of the German census of 1939 of size 10k. 


designed for, i.e. for discrete total variation outlier detection. 
Additional emphasis is specifically put on cases where the 
number of outliers gets very close to the number as the inliers. 


2) EXPERIMENTAL SETUP 

Since earlier in the paper it was mentioned that demographics 
is one of the fields that can benefit from discrete total varia- 
tion outlier detection, the beta distribution with a = 2 and 
B = 3 was chosen for the inlier samples’ distribution. The 
reason is the resemblance of its histograms to the histograms 
of some population age distributions. For all experiments the 
number of bins n was fixed to 100. The outlier samples were 
initially also drawn from the same beta distribution and their 
histograms also had 100 bins. However, the outlier samples’ 
histograms underwent an additional change to simulate the 
so called age heaping [16]. Namely, for a certain amount of 
randomly chosen bins with a count greater than 0, their count 
was decreased by | and the count of the closest bin to each of 
them whose ordinal number was divisible by 5 was increased 
by 1 as can be seen on the example that is shown in Fig. 10. 
This was done for various combinations of the amount of 
outlier samples and the amount of randomly chosen bins that 
were changed for these outlier samples’ histograms. Finally, 
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the performances of the proposed method and of the baseline 
method were then compared for all these combinations. 


3) RESULTS 

The obtained results and comparisons are shown in Fig. 11. 
It can be seen that if there are only a few outliers, then the 
proposed and the baseline methods are on par with each other 
and there are only some smaller differences in performance 
for various amount of heaped values. However, as the number 
of outliers increases, the proposed method starts to signifi- 
cantly outperform the baseline method, especially in cases 
where RANSAC is used as suggested in Section III-E. This is 
especially noticeable in Fig. 11f where the number of outliers 
is very close to the number of inliers. There the baseline 
effectively degrades to a random chooser, while the proposed 
method used in combination with RANSAC excels at outlier 
detection. This shows the usefulness of the proposed methods 
for the task of finding the discrete total variation outliers. 


D. CENSUS DATA 

1) THE GOAL 

The goal of this subsection is to test the proposed method 
on an example of real-life census data with sample sizes 
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spanning several orders of magnitude and being drawn from 
slightly different, but similar distributions. Here a closer look 
is taken at the samples of the top-scoring histograms. This 
can show the robustness of the proposed method in noisy 
conditions and its usefulness for real-life data applications. 


2) EXPERIMENTAL SETUP 

Several census data sources have been used for the experi- 
mental setup. The largest of them is the German census of 
1939 [40] with the corresponding birth year histogram being 
shown in Fig. 12a. Since the significant gap for the years of 
World War I can be traced in age composition of other similar 
lists and censuses of other countries as well [46], [47], this 
census data is used here as a gold standard for the discrete 
total variation of the population histograms for that time. 

In addition to that, 7106 variously sized censuses, i.e. lists 
of people with birth years available at the website of the 
United States Holocaust Memorial Museum (USHMM) [48] 
are used since they were composed for the populations from 
roughly the same time frame. The distribution of the majority 
of the sizes with the largest ones being excluded for practical 
purposes is shown in Fig. 13. The geographical locations of 
these populations differ, but they still mostly cover the popu- 
lations whose birth year histograms should have similar dis- 
crete total variation properties. To make it clear immediately, 
this does not necessarily mean that the age distributions are 
similar as well. Namely, one census can have a significantly 
higher amount of e.g. young people in comparison to other 
censuses, but as it will be shown later on concrete examples, 
this should not necessarily affect the discrete total variation 
of the birth year histograms too significantly. Therefore, these 
lists available at USHMM constitute an interesting dataset in 
which to look for outliers in terms of discrete total variation. 


3) RESULTS 

The first experiments that were carried out consisted of sim- 
ply taking many variously sized subsamples of the birth years 
from the German census of 1939, calculating the discrete 
total variations of their birth year histograms, and fitting the 
proposed method’s model in Eq. (49) to the data obtained 
in this way. Fig. 12b shows the result of this experiment. 
The proposed model fits well to all data. This also holds for 
smaller subsamples where the influences of the two terms 
in Eq. (49) are still on par. It can also be seen how the discrete 
total variations get more dispersed as the sample size grows. 
While this may hint at heteroscedasticity, applying weighted 
regression or variance-stabilizing data transformations did 
not significantly change the results that are described 
here. 

The relation between the sample size and the standard 
deviation of the discrete total variation is shown in Fig. 12c. 
Very similar results are obtained for other distributions as 
well. It can be seen that the relation is very similar to 
the square root function, which effectively justifies the use 
of Eq. (51) for practical purposes. The distribution of the 
discrete total variations for the subsamples of the same size 
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FIGURE 13. The distribution of the sizes of the majority of the 7106 
USHMM lists that are used for the experiments. 


closely resembles the normal one as shown in Fig. 12d 
with the remark that the discrete total variations there are 
integers. 

After conducting the relatively simple mentioned experi- 
ments in order to get a better insight into the inner workings 
of the proposed method, the next step was to apply the method 
to all USHMM lists whose data includes birth years. The 
distribution of the values of d’ described in Eq. (51) and 
obtained by the proposed method in this way is shown in 
Fig. 15, while the relation between the calculated discrete 
total variations and the predicted ones are shown in Fig. 14. 

It can be seen that the majority of the values d’ in Fig. 15 are 
not spread too widely with the exception of several outliers. 
Before analyzing these outliers in more detail and comment- 
ing on Fig. 14, it must be stressed that in Fig. 14 the plot 
axes use the logarithmic scale to better accommodate the pre- 
sentation to the list’ size distribution. Therefore, the apparent 
misfit for the smallest lists can deceive into believing that the 
proposed model failed to fit properly, while it is actually only 
a misfit on a small scale. For similar reasons many of the 
differences between the calculated and the predicted discrete 
total variations for the larger lists are higher than they may 
appear to be on the plot. In addition to showing the proposed 
method’s model, the Monte Carlo model based on the average 
discrete total variations of the variously sized subsamples of 
the German census of 1939 is shown in Fig. 14 for compar- 
ison. It can be seen that on several places its predictions are 
not quite aligned with the ones of the proposed model, which 
can be attributed to the distribution shown in Fig. 13, i.e. to 
the significant influence of samples of certain sizes during 
the model fitting. This can be alleviated by using techniques 
such as taking only samples of evenly spaced sizes, but as 
shown later in this subsection, the top results for the two 
models do not differ significantly even without applying such 
techniques. Therefore, the application of such techniques was 
omitted. 

Out of the 7106 lists that were analyzed, the top three 
outlier lists in terms of d’ were the Jasenovac camp inmates 
list [42] with d’ © 43.13, the list of the Soviet Extraordinary 
Commission [43] with d’ ~ 36.5, and the list for the Franz 
Street Number 38 [44] with d’ ~ 31.29. The histograms for 
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FIGURE 14. Applying the proposed method to 7106 lists of the USHMM data. The model based on applying the Monte Carlo 
simulation to the German census of 1939 is shown for comparison. Note that the plot axes use the logarithmic scale. 
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FIGURE 15. The distribution of values d’ calculated by the proposed 
method for birth years from the USHMM lists. 


these lists are shown in Figs. 16a, 16b, and 16c, respectively. 
A more detailed analysis of the top-scoring histogram that 
provides additional insights and explanations of the behavior 
of the proposed method’s scoring is available in Appendix. 
In the case of Monte Carlo the score d” was calculated as 


a = | Pell = Ay as 
ON 

where jiy and Gy are the mean and the standard deviation, 

respectively, of the discrete total variation obtained for a large 

number of subsamples of size N of the German census of 

1939. The distribution of the values of d” obtained for all 

lists from the USHMM is shown in Fig. 17. The lists with 


the first and second highest value of d” were the same as for 
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d' with d” © 58.14 and d” ~ 49.04, while the list with the 
third highest value of d” was the list of Jewish refugees in 
Tashkent [45] with d” ~ 44.51 and with the corresponding 
histogram shown in Fig. 16d. Already by looking at the 
mentioned figures for the top-scoring lists it can be seen that 
their corresponding histograms indeed have high values of 
discrete total variation with spikes, i.e. individual bins that 
significantly differ from their neighbors, which contrasts the 
smoothness of the histogram for the German census of 1939. 


E. ADVANTAGES OVER EXISTING METRICS 

Like for many other groups of population histograms, there 
is no ground-truth ordering for USHMM lists in terms of 
their histograms’ smoothness or accordance with historical 
populations. Because of that, the quality of ordering obtained 
by the proposed method and by existing metrics such as 
Whipple’s and Myers’ indices can not be compared directly. 
However, it is possible to show cases that are problematic for 
both of these indices, but not for the proposed method. 

The first example is the histograms shown by 
Figs. 18a and 18b, which represent the top-scoring his- 
tograms among the USHMM lists’ histograms for the Whip- 
ple’s and Myers’ indices, respectively. It can be seen that 
these histograms are actually relatively smooth, but they 
also contain only a few non-zero values: the first one 8 
and the second one only a single. These histograms can 
hardly be considered outliers in terms of smoothness when 
compared to the histograms in Fig. 16, but rather outliers 
in terms of covered years span, which is different and also 
detectable by much simpler techniques. Additionally, the lists 
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FIGURE 16. Top-scoring birth year lists out of 7106 checked lists: a) the Jasenovac camp inmates available at USHMM’s webpage [42], b) the victims 
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FIGURE 17. The distribution of values d” calculated by the proposed 
method for birth years from the USHMM lists. 


that produced these histograms have only a relatively small 
number of birth years and since the mentioned indices, unlike 
the proposed method, do not take into account the sample 
size, they are also more prone to anomalies that arise in 
smaller samples due to randomness. 

Another problem with metrics such as Whipple’s and 
Myers’ indices is that they are mainly concerned with fre- 
quencies and do not take into account other properties such 
as shape or smoothness. Because of that, for different samples 
that have the same frequencies of last digits of their numbers, 
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it is still possible to obtain the same values of the mentioned 
indices even if the samples’ histograms differ significantly. 
An example of this is given in Fig. 19 with a fully smooth 
histogram that has the same indices values as a histogram 
that can hardly be considered smooth. While numerous sim- 
ilar examples exist, the ones presented are enough to show 
the frequency-based weakness of the Whipple’s and Myers’ 
indices. On the other hand, the proposed method has no such 
problems and its values for the histograms in Fig. 19 differ 
significantly with one being zero and the other one non-zero. 
In short, while being widely used and useful in certain 
cases, metrics such as the Whipple’s and the Myers’ indices 
are too simple to properly handle properties such as smooth- 
ness. Therefore, the proposed method’s ability to specifically 
target smoothness is its main advantage over other metrics. 


F. DISCUSSION 

Looking at the distributions shown in Figs. 15 and 17 and 
observing the significant difference between the majority of 
the scores and the highest scores, it can be concluded that 
the histograms of the used USHMM lists that obtained the 
highest scores are indeed outliers in terms of the discrete total 
variation. Since the analyzed data consisted of birth years, 
it may seem that an appropriate tool for identifying outliers 
such as the ones in Fig. 16 could be the Whipple’s index [46], 
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FIGURE 18. Top-scoring birth year lists out of 7106 checked lists for a) the Whipple's index and for b) the Myers’ index. 
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FIGURE 20. Same data for Jasenovac inmates as in Fig. 16a, but with additional markings for the age heaping [16] artifacts. 


but due to its fixed nature of checking only specific kinds in the case of the histogram of the Jasenovac inmates shown 
of data, it is often inappropriate [49], [50]. This also holds in 16a whose artifacts are marked more closely in Fig. 20. 
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FIGURE 21. The comparison of the values of the theoretical discrete total variation ||D|| , of the histograms of normal distribution \V(0, o) 
for the values in the interval [—b, b] for various number of histogram bins used to obtain the experimental results that were shown earlier in 
Fig. 7: a) c= 5,0 = 0.9, b) c= 5,0 = 1.5, c) C= 5,0 = 0.5, and d) c = 10,0 = 0.5. 


It can be seen that age heaping occurs in several forms that 
the Whipple’s index not only cannot pick up, but it also gets 
hampered by them. Namely, in its slightly changed form the 
Whipple index checks for a surplus of years ending in 0 or 5 
when compared to other years, but in the case of Jasenovac 
there is also a surplus of years ending in 2, which is not 
checked by the Whipple’s index and it actually reduces the 
overall surplus of years ending in 0 or 5, thus hampering 
the Whipple’s index in detecting the unusual data patterns. 
Since the proposed method has no such problems, it may 
be more appropriate in situations similar to the one in this 
experiment. 

Besides all these histogram artifacts, there are other pecu- 
liarities with the Jasenovac list. Namely, if it is compared to 
other USHMM lists used here, it directly contradicts some 
of them. For example, the list available at [51] states that a 
certain Stanko Nick survived the war [52], while the Jasen- 
ovac list claims that he was killed [53], which is known to 
be wrong [54]. In another example, the list available at [55] 
states that a certain Josip Stern arrived at Auschwitz in 
1942 [56], while the Jasenovac list claims that he was killed 
in 1941 [57]. This means that the proposed method can also 
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be used to detect samples that contain potentially problematic 
data with properties not always shared with the usual outliers. 


G. SOURCE CODE AND DATA REPOSITORY 

The source code written in the Python programming language 
and the data required to recreate the results described in this 
section are publicly available in a dedicated GitHub reposi- 
tory.! At the time of writing this paper, the census data used in 
this section was publicly available at the USHMM website, 
but for the sake of simplicity of recreating the results, it is 
also available in the repository. While the census data also 
contains other information alongside the birth years, only the 
birth years were copied to the repository in order to avoid 
data privacy violation for potentially still living persons. For 
example, according to the Jasenovac camp inmates list [42], 
which was already shown to be problematic, a certain Stojan 
Razokrak [58] was allegedly killed in 1942, but a publicly 
available video of him?** from 2012 and its transcript* clearly 


: https://github.com/DiscreteTotal Variation/T VOR 
2https://www.youtube.com/watch?v=S5IRWT63as0 
3 https://archive.is/48sKw 

a https://archive.is/RtnsJ 


VOLUME 9, 2021 


N. Banic¢, N. Elezovic: TVOR: Finding DTV Outliers Among Histograms 


IEEE Access 


1 100 200 300 400 500 600 700 800 900 1000 
Sample size 
(a) 


600 + 


500 4 


DTV 


1 100 200 300 400 500 600 700 800 900 1000 
Sample size 
(c) 


1 2-103 4-102 6-103 8-103 104 
Sample size 
(e) 


1000 4 


800 4 


600 4 


DTV 


400 5 


200 4 


1 100 200 300 400 500 600 700 800 900 1000 
Sample size 
(b) 


Q 1504 


1 100 200 300 400 500 600 700 800 900 1000 
Sample size 
(d) 


1 2-104 4-104 6-104 8-104 105 
Sample size 


(f) 


FIGURE 22. The DTVs of histograms of random samples drawn from A (0, 0?) and of sizes randomly chosen to be between 1 and U. The 


number of bins n and the upper size bound U are set to a) n = 5 and U = 1000, b) n = 10 and U = 1000, c) n = 25 and U = 1000, d) n = 50 and 
U = 1000, e) n = 50 and U = 10%, and f) n = 50 and U = 10°. The lines represent the value of ||D|ly described in Eq. (37) and multiplied by the 


sample size, while the dots represent the random samples. 


show the opposite. Because of that, it seemed reasonable to 
copy only the birth years, while any interested reader can 
check the rest of the data at the USHMM website by using 
the appropriate list identifier given in the repository. 


V. CONCLUSION AND FUTURE WORK 
In this paper, a method for finding discrete total variation 
outliers among histograms has been proposed. It scores 


VOLUME 9, 2021 


histograms based on the deviation of their discrete total 
variation from its expected value. To carry out this scor- 
ing, a Statistical framework has been proposed. One of the 
method’s main advantages is that in order to work it requires 
no information about the distribution of the samples that are 
being described by histograms. In some special cases the 
proposed method even outperforms the Pearson’s chi-squared 
test when looking for the outlier histograms in terms of the 
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FIGURE 23. The dependence of the performance of the baseline and the 
proposed method on the random samples’ size range when 100 inlier 
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N (0, 0.97), the number of bins n is 15, c = 5, and the size of the inlier 


and the outlier samples is randomly chosen to be between L and 10. L 
where L is the lower size bound that is shown on the x-axis. 


sample distribution despite the fact that is was not designed 
for this task. On the other hand, the proposed method clearly 
outperforms the Pearson’s chi-squared test when looking for 
discrete total variation outliers, especially in cases of a huge 
amount of outliers. Overall, the proposed method represents 
a successful proof-of-concept of how discrete total variation 
that is used in the method’s modelling can be applied to 
histogram outlier detection in terms of discrete total variation, 
which has been experimentally confirmed on synthetic and 
real-life data. Future work may include looking for some 
other histogram properties that can also be used for histogram 
outlier detection in terms of their smoothness in the cases 
where the distribution of the histogram samples is unknown. 
As for improving the proposed method, future work will 
include at least two things. The first of them is the analy- 
sis of variance for the discrete total variation to potentially 
improve the scoring criteria. The second of them comprises 
other aspects of the histogram’s discrete total variation that 
could decrease the scores obtained for the inlier samples, but 
simultaneously keep the scores obtained for the outliers high. 


APPENDIX 
A. PROOFS OF THE THEOREMS 
1) PROOF OF THEOREM 1 
By the definition of F(2, N), it can be developed as follows: 
F2,N) = E[Ilx2llv] 
iS) 


ee a (") (N = OR). (53) 
k=0 


For an even N = 2r, the equality ya (7) = 2" leads to 
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Since here [(N + 1)/2] = |(2r + 1)/2] =r and [N/2] = 


[(2r + 1)/2] = r, it follows that Eq. (54) matches Eq. (19). 
For an odd N = 2r + 1, a similar calculation as before gives 


y ce ') (Qr+1-2k) =(r +1) ("? ) (55) 


k=0 


To avoid any possible confusion, it has to be mentioned 
that the lower index of the binomial coefficient in Eq. (55) 
can also be set to r + | because N is supposed to be odd 
there. Furthermore, like in the previous case, it can be seen 
that Eq. (55) also matches Eq. (19), which proves Theorem 1. 


2) PROOF OF THEOREM 2 
The expectation E [|x2 — x;|] can be written as follows: 


N 
[x2 —mill=n™% So (,, z,) Hohl (56) 


ky t+e+kn=N 


The right-hand side of Eq. (56) can further be written as 
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To obtain Eq. (23) from here, it is sufficient to note that 


f [x2 — x1|] = E [xo — x1 | x2 > x1] 


+E [x] — x2 | x2 < x1] 


= Elx. — x, | x2 > x1] 


+E [x2 — x1 | x1 < x2] 


= 2E [x2 — x1 | x2 > x1]. (58) 


Eq. (58) can be applied to Eq. (57), which can then be applied 
to Eq. (21). This results in Eq. (23), which proves Theorem 2. 


B. THEORETICAL DISCRETE TOTAL VARIATIONS 

To facilitate a better understanding of the experimental results 
that were discussed in Section IV-B and shown in Fig. 7, 
the comparison of the values of the theoretical discrete total 
variation ||D||y of the histograms of normal distribution with 
parameters used to obtain these results are given in Fig. 21. By 
comparing Figs. 7 and 21, it is relatively easy to explain phe- 
nomena such as the sudden drops in the proposed method’s 
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FIGURE 24. Birth year histograms of Jasenovac camp inmates [42] by nationality with markings for age heaping: a) Roma inmates, b) Jewish 
inmates, c) Croatian inmates, and d) Muslim inmates. Only the histograms for nationalities for which there are more than 1000 listed inmates are 
shown here, while the histogram for the Serbian inmates is given separately in Fig. 25. 


performance that can be seen in Fig. 7d when 10 bins are used. 
Namely, Fig. 21d clearly shows that for 10 bins the difference 
between the theoretical DTVs of the distributions used there 
is very small, which renders the proposed method inadequate 
for recognizing outlier samples for that specific case. Similar 
reasoning can also be applied to successful cases where this 
difference is sufficiently large. 


C. DEPENDENCE OF VARIATION ON THE SAMPLE SIZE 
Fig. 9 clearly shows how randomness can have a significant 
impact on the performance of the proposed method. Never- 
theless, as described by Eq. (48), when the samples’ sizes 
grow, this impact becomes ever smaller. However, in order 
to decrease this impact in cases of e.g. larger values of n, 
the samples’ sizes have to grow significantly more than in 
the cases of smaller values of n. This is illustrated on several 
examples shown in Fig. 22. There it can be seen that for 
n = 5 the samples with random sizes up to 1000 are clearly 
separated, while for the same sizes and n = 50 the samples 
can hardly be separated. However, as shown in Fig. 22e 
and Fig. 22f, if the upper bound for the sizes of random 
samples gets increased even further, the separation again 
becomes clear. As shown in Fig. 23, this has a direct influence 
on the performance of both the baseline and the proposed 
methods. 
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In short, a successful application of the proposed method 
assumes a reasonably high ratio between the number of bins 
n and the sizes of samples. How high this ratio should be, 
however, depends on the specific distributions of the samples. 


D. PARTITIONING THE TOP-SCORING HISTOGRAM 

In order to describe the behavior of the proposed method in 
more detail, it may be useful to additionally analyze the top- 
scoring histogram shown in Figs. 16a and 20. By partitioning 
the initial birth year sample into more smaller samples, it 
is possible to examine the behavior of the proposed method 
when the sample size is changing. One way of partitioning the 
sample is by nationality of the inmates. The nationalities for 
which there are more than 1000 listed inmates are, as speci- 
fied in the Jasenovac inmates list, the following ones: Serbian, 
Roma, Jewish, Croatian, and Muslim. While the histograms 
of the Roma, Jewish, Croatian, and Muslim nationalities 
shown in Fig. 24 all exhibit signs of age heaping similar to 
the ones in Fig. 20, by far more prominent signs are exhibited 
by the Serbian nationality as shown in Fig. 25. 

If the histograms for separate nationalities are also added 
to the set of USHMM lists and the proposed method is 
applied to this extended set, then the histogram for the Serbian 
nationality ends up being the second most likely outlier just 
after the whole Jasenovac list with d’ = 40.82. The Romani 
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FIGURE 25. Birth year histogram of Jasenovac inmates of Serbian nationality with same markings for age heaping as in Fig. 20. 


nationality histogram ends up on the 21st place with d’ = 
15.12, the Jewish nationality histogram ends up on the 66th 
place with d’ = 9.08, while other histograms are not inside 
the 100 most likely outliers. This shows how the proposed 
method can also be used to detect the potentially problematic 
parts of a sample, which in the case of the Jasenovac list lies 
in the birth years of Serbian inmates. 

Additionally, there is another thing to be observed here. 
Namely, while Figs. 20 and 25 seem to be very similar, the 
score d’ for the histogram of the birth years of the Serbian 
inmates was nevertheless smaller than the one for the whole 
Jasenovac list. This has to do with the fact that the sample with 
birth years of Serbian inmates has fewer values than the whole 
Jasenovac list, i.e. it makes up roughly 57% of the Jasenovac 
list. Because of that, such similar deviations are considered to 
be less likely on a larger sample and thus the whole Jasenovac 
list has a slightly larger value of score d’. 
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